Skip to content

Support GigaChat3#995

Merged
ikawrakow merged 3 commits intomainfrom
ik/support_gigachat
Nov 24, 2025
Merged

Support GigaChat3#995
ikawrakow merged 3 commits intomainfrom
ik/support_gigachat

Conversation

@ikawrakow
Copy link
Copy Markdown
Owner

This PR adds support for GigaChat3 and closes #994

The model uses the same MLA attention mechanism as DeepSeek, but with a twist, where the value length is not 128 as in DeepSeek models, but 192. I guess, everybody feels the need to make a creative alteration to an existing architecture.

Here some sweep-bench results for the 10GB-A1.8B variant (https://huggingface.co/ai-sage/GigaChat3-10B-A1.8B-bf16) quantized as Q8_0.

ik_llama.cpp, RTX-4080

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 256 0 0.167 12279.50 1.131 226.45
2048 256 2048 0.181 11321.48 1.159 220.96
2048 256 4096 0.226 9059.58 1.199 213.53
2048 256 6144 0.272 7531.24 1.231 207.89
2048 256 8192 0.317 6452.18 1.348 189.97
2048 256 10240 0.364 5619.66 1.380 185.54
2048 256 12288 0.409 5009.50 1.383 185.10
2048 256 14336 0.455 4499.59 1.388 184.42
2048 256 16384 0.500 4092.72 1.476 173.42
2048 256 18432 0.549 3729.05 1.511 169.48
2048 256 20480 0.596 3435.35 1.521 168.27
2048 256 22528 0.646 3168.05 1.521 168.28
2048 256 24576 0.695 2947.77 1.606 159.37
2048 256 26624 0.743 2757.12 1.644 155.68
2048 256 28672 0.796 2572.42 1.651 155.07
2048 256 30720 0.838 2444.78 1.654 154.81

llama.cpp, RTX-4080

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 256 0 0.268 7650.07 1.285 199.20
2048 256 2048 0.335 6120.87 1.325 193.18
2048 256 4096 0.444 4614.00 1.355 188.91
2048 256 6144 0.555 3688.46 1.395 183.52
2048 256 8192 0.654 3131.34 1.435 178.44
2048 256 10240 0.739 2770.27 1.575 162.50
2048 256 12288 0.832 2461.58 1.597 160.33
2048 256 14336 0.946 2165.80 1.610 159.01
2048 256 16384 1.045 1960.65 1.625 157.50
2048 256 18432 1.127 1816.52 1.637 156.40
2048 256 20480 1.238 1654.25 1.762 145.30
2048 256 22528 1.335 1533.68 1.775 144.23
2048 256 24576 1.435 1427.38 1.790 143.05
2048 256 26624 1.529 1339.09 1.797 142.47
2048 256 28672 1.615 1268.26 1.812 141.32
2048 256 30720 1.715 1193.94 1.936 132.20

ik_llama.cpp, CPU-only, Ryzen-7950X

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 128 0 2.373 862.94 4.032 31.75
2048 128 2048 3.546 577.61 4.295 29.80
2048 128 4096 4.769 429.44 4.487 28.53
2048 128 6144 6.052 338.42 4.671 27.40
2048 128 8192 8.031 255.01 4.853 26.37
2048 128 10240 9.201 222.58 5.081 25.19
2048 128 12288 10.847 188.81 5.229 24.48
2048 128 14336 12.062 169.79 5.478 23.37
2048 128 16384 13.330 153.64 5.643 22.68

llama.cpp, CPU-only, Ryzen-7950X

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 128 0 13.024 157.24 4.145 30.88
2048 128 2048 26.384 77.62 4.870 26.28
2048 128 4096 40.148 51.01 5.686 22.51
2048 128 6144 53.378 38.37 6.513 19.65
2048 128 8192 66.855 30.63 7.294 17.55
2048 128 10240 80.105 25.57 8.129 15.75
2048 128 12288 93.748 21.85 9.104 14.06
2048 128 14336 107.374 19.07 9.801 13.06
2048 128 16384 121.745 16.82 10.743 11.92

@Nexesenex
Copy link
Copy Markdown
Contributor

Gigachad! :D

Note : Rope cache doubles the perplexity when used.

@Nexesenex
Copy link
Copy Markdown
Contributor

Also, quantized K cache might need to be adjusted:

If I use:

llama-perplexity -m GigaChat3-10B-A1.8B-Q8_0.gguf -mg 2 --override-kv deepseek2.expert_used_count=int:4 -c 512 -mqkv -gr -ctk q8_0 -ctv q8_0 --host 127.0.0.1 --port 8080 -f wiki.test.raw

I get:

perplexity: tokenizing the input ..
perplexity: tokenization took 325.438 ms
perplexity: calculating perplexity over 610 chunks, n_ctx=512, batch_size=2048, n_seq=4
CUDA error: invalid argument
current device: 0, in function ggml_backend_cuda_buffer_set_tensor at Q:\GitHub\ik_llama.cpp.fks\ggml\src\ggml-cuda.cu:557
cudaMemcpyAsync((char *)tensor->data + offset, data, size, cudaMemcpyHostToDevice, ((cudaStream_t)0x2))
Q:\GitHub\ik_llama.cpp.fks\ggml\src\ggml-cuda.cu:123: CUDA error

@ubergarm
Copy link
Copy Markdown
Contributor

@Nexesenex

I'm only testing on CPU where I'm not seeing that error (maybe makes sense given it looks like CUDA path issue).

I don't even know what -mg 2 is but I tried with various combinations of:

    -mg 2 \
    -mqkv \
    -ger \
    --override-kv deepseek2.expert_used_count=int:2 \
    -ctk q8_0 \

I didn't use -ctv q8_0 given this is MLA attention, but seems okay. It does get "dumber" going down to only 2 experts haha...

@Nexesenex
Copy link
Copy Markdown
Contributor

Nexesenex commented Nov 21, 2025

@ubergarm : yeah, I tested with 3 experts and the PPL is not so bad. Time for inference now!

-mg is main gpu, a relic of the early llama.cpp versions used to select the GPU for mono-gpu inference, or the KV cache destination with split-row. I never knew if something else was aimed at by this command, so I set it still on my fastest GPU.

As for ctv, I always forget to remove it because it's either used, either irrelevant. :D

@magikRUKKOLA
Copy link
Copy Markdown

@Nexesenex

Note : Rope cache doubles the perplexity when used.

For the K2-Thinking too :)

@ikawrakow ikawrakow merged commit f119103 into main Nov 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GigaChat3 models check_tensor_dims has wrong shape

4 participants